Interpretable neural architecture search and transfer learning for understanding CRISPR/Cas9 off-target enzymatic reactions

Finely-tuned enzymatic pathways control cellular processes, and their dysregulation can lead to disease. Creating predictive and interpretable models for these pathways is challenging because of the complexity of the pathways and of the cellular and genomic contexts. Here we introduce Elektrum, a deep learning framework which addresses these challenges with data-driven and biophysically interpretable models for determining the kinetics of biochemical systems. First, it uses in vitro kinetic assays to rapidly hypothesize an ensemble of high-quality Kinetically Interpretable Neural Networks (KINNs) that predict reaction rates. It then employs a novel transfer learning step, where the KINNs are inserted as intermediary layers into deeper convolutional neural networks, fine-tuning the predictions for reaction-dependent in vivo outcomes. Elektrum makes effective use of the limited, but clean in vitro data and the complex, yet plentiful in vivo data that captures cellular context. We apply Elektrum to predict CRISPR-Cas9 off-target editing probabilities and demonstrate that Elektrum achieves state-of-the-art performance, regularizes neural network architectures, and maintains physical interpretability.


Introduction
Enzymatic reactions control the rate at which cellular processes occur.Proteins are broken down, energy is stored, and DNA is replicated thanks to the finely tuned enzymatic pathways of our cells.Dysregulation of these pathways can lead to various diseases, making the study of them an important part in biomedical research [1][2][3][4].However, these reaction pathways can be extremely complex, traversing many intermediate states with unknown kinetic rates before obtaining a final product or event [5][6][7][8].This makes discovering and testing biochemical kinetics models an arduous task.This task is further complicated by the fact that the cellular environment is complex and introduces confounding factors.To better understand the chemical kinetics of our cells, we require new analysis tools.
Deep learning is now a commonly used tool in the biological sciences, being applied to tasks such as analyzing images, predicting protein structures, and processing genomic data.Neural networks (NNs) are especially effective at finding patterns in genomic datasets [9][10][11][12][13][14].This is because NNs train efficiently with gradient descent and yet can have great predictive power with little prior knowledge.However, given NNs' generic structure and large number of trained parameters, it is difficult to gain deeper scientific insights or generalize to systems outside of the datasets on which they were trained.
Interpretable neural networks (INNs), on the other hand, incorporate prior knowledge and/or can be examined to extract more information on the systems they model.Examples of INN include free-energy architectures that map exactly to a partition function-like formula.Upon training, the network node parameters are the chemical reactions' sequence-dependent equilibrium constants.Free-energy neural networks have been applied to phenotypic expression assays and deep-mutational protein scanning assays [15][16][17][18].A less explored, yet more informative INN architecture is the Kinetically Interpretable Neural Network (KINN).KINNs correspond to a set of linear ordinary differential kinetic equations for the relative occupation of enzymatic states, and whose coefficients are state transition rate constants [15].These learned rates allow prediction of more complicated biochemical scenarios, like non-steady-state systems and multiple species reactions, by calculating the kinetic system's eigenvalues(in preparation).
Unfortunately, KINNs and free-energy neural networks suffer several limitations, restricting wide adoptions.First, they require a candidate kinetic model before training, losing the advantage of needing minimal information when creating the machine learning model.Furthermore, training datasets are restricted to data generated by specific in vitro assays [19][20][21].While in vitro data harbors little noise or confounding effects compared to their in vivo counterparts, it is challenging to evaluate how well in vitro results and predictions apply to in vivo scenarios.Finally, in vitro kinetic and free-energy datasets are less common and smaller, making it harder to train and validate machine learning model results.Even though in vitro trained models have shown promise, these challenges undermine their performance when applied to understanding in vivo settings, which is the ultimate goal.
We introduce a new deep learning framework called Elektrum that generates datadriven mechanistically interpretable networks for in vivo kinetic systems by integrating data from in vitro and in vivo assays.One of Elektrum's key innovations is its transfer learning step where instances of a kinetic model ensemble are tested as intermediate layers of a deeper convolutional neural network (CNN).This leverages the power of transfer learning that repurposed a trained model architecture to a new but related task, enabling effective use of both the limited but clean in vitro data and plentiful but complex in vivo data.Our transfer learning method is distinct from conventional transfer learning because we train a smaller NN on a specialized dataset to achieve interpretability, then gain deeper insight by augmenting the network with a contextaware CNN backbone.
We show this approach regularizes the NN architecture while maintaining interpretability, even for the ultimate challenge of predicting in vivo processes.We applied Elektrum to model CRISPR-Cas9 off-target editing kinetics.Existing CRISPR/Cas9 predictors fall into one of two classes: accurate, data-driven machine learning models or expert-derived, mechanistic biophysical models.Our model is the first to overcome the trade-off of predictability versus mechanistic interpretation by achieving both stateof-the-art performance for predicting in vivo off-target editing and providing physical interpretability of the enzymatic kinetic scheme.

Overview of Elektrum Framework
Elektrum is a deep learning framework for modeling in vitro and in vivo sequencedependent enzymatic kinetics (Figure 1).By utilizing neural architecture search (NAS) [22] and transfer learning techniques, Elektrum automatically discovers a system's underlying kinetics by generating and testing different KINN architectures.While searching, Elektrum creates a set of candidate KINNs that predict a measurable kinetic rate in vitro.Leveraging the resulting ensemble of possible kinetic models, we then employ a second transfer learning NAS step to search for a hybrid CNN+KINN model.This second step extends the KINN architecture by using a more complex CNN backbone to learn nuanced sequence-dependent signals hidden in the in vivo data.The NAS incorporates pretrained KINN models as an intermediate layer, selecting for one of the KINNs that maximizes in vivo model performance.
When searching the KINN architecture space, Elektrum efficiently builds multiple candidate kinetic models in a data-driven fashion, in contrast to a single kinetic system derived by human experts.Briefly, Elektrum defines a set of prior distributions  We use a probabilistic model-building genetic algorithm to search for and create a set of Kinetically Interpretable Neural Networks (KINNs), a special type of sparse neural networks that has one-to-one correspondence with kinetic models.These KINNs differ in number of enzymatic states and reaction rates but model the in vitro data equally well.B) KINNs can be translated to a system of first-order rate equations.For CRISPR-Cas9, this corresponds to a multi-step enzymatic reaction, where the kinetic rate for each enzymatic step is determined by the DNA-gRNA sequence pair.C) Transfer learning integrates KINNs pretrained on in vitro data with a deep neural network to predict in vivo Cas9 cleavage.We employ neural architecture search (NAS) to optimize the convolutional and bidirectional layer while searching for an optimal static KINN layer.This integrative approach expands the types and amount of applicable training datasets by many fold while maintaining kinetic interpretability.
over the KINN model architecture and updates the posterior probabilities based on performance (Methods).A candidate KINN's convolutional first layer samples DNA-gRNA subsequences to model the reaction-rate/sequence dependency.Elektrum then gathers the kinetic rates between enzymatic states and employs a 3-layer King-Altman neural network to calculate the steady-state production rate of an experimentally measurable quantity, such as cleavage products.Alternatively, one may use a differentiable eigen-decomposition layer to model non-steady-state kinetic system dynamics, which is important for understanding protein binding and cleavage in experimental settings (Methods).The KINN's performance on a held-out dataset is used to update the posterior model architecture distributions following a Bayesian probabilistic genetic algorithm (Methods).When benchmarked on simulated kinetic systems under a variety of different settings, we show that our search algorithm identified substantially more accurate models compared to random sampling (Extended Data Figure 1).Importantly, KINNs have the one-to-one correspondence with a set of linear ordinary differential equations (ODEs), therefore enabling direct discovery of biophysical knowledge.This is in contrast to conventional model interpretation and feature attribution methods, where deriving a set of kinetic ODEs from the feature importance is non-trivial.Given a set of pretrained KINNs, it is crucial to evaluate how these in vitro kinetic models apply to in vivo reactions.This evaluation is challenging because in vivo enzymatic processes are more complex, and may not have directly measurable products.Two innovations make this evaluation possible.First, pretrained KINNs are used as special neural network layers, such that in vitro kinetic systems with different numbers of states and kinetic rates are searched to optimize the in vivo model's performance.Second, Elektrum accounts for perturbative effects to the kinetic rates from in vivo sequence context by adding a deep CNN backbone.This integrative approach expands the types and amount of applicable training datasets by many fold while maintaining kinetic interpretability.As we demonstrate below, the transfer learning scheme substantially improves the final model performance.

Modeling in vitro Cas9 off-target cleavage by KINN search
We apply Elektrum to model the enzymatic pathway of the Cas9 protein as it binds and cleaves a DNA sequence (Figure 1B).When charged with a 20 nucleotide (nt) guide RNA molecule (gRNA), Cas9 enzymes will attach to a complementary DNA target that it subsequently cleaves.However, partial sequence matches may also be cleaved, though typically at lower rates, which are demonstrably detrimental to the safety of gene editing therapies [23][24][25].Therefore, the cleavage rate dependence on gRNA-DNA match poses an important kinetic learning problem for off-target editing [26][27][28][29][30].To learn Cas9 cleavage kinetics, we utilized the only published in vitro Massively Parallel Kinetic Profiling (MPKP) dataset of Cas9 cleavage [21].Two gRNAs were profiled and their time-resolved concentrations of cut DNA molecules were measured from a library of substrates with saturated mutations and indels.This experiment measured their absolute cleavage rates by fitting a single exponential rate function for each gRNA-DNA pair.
Elektrum optimizes KINNs to accurately predict Cas9 cleavage rates by first creating a KINN model space -the number of intermediate kinetic rates, rate dependencies, and enzymatic states -using an expert-curated kinetic model [26,27] as a weak prior (Methods).Elektrum leverages a genetic algorithm to sample KINNs from a model space with prior probability centered around expert knowledge of a four-state kinetic model reported previously [26].Note that this is different from KINNs being initiated from the expert-curated kinetic model.The model's predictive power improved as Elektrum iteratively optimized the KINN architectures (Figure 2B).Elektrumoptimized KINN performed 25.2% better in test Pearson correlation than the baseline randomly sampled CNNs on the same feature and label set (Extended Data Figure 1).Our searched KINNs were highly consistent with the expert-curated model (Extended Data Table 1).More impressively, these KINNs perform on par with the state-of-theart accuracy of non-interpretable deep CNNs identified by neural architecture search [22] (Figure 2B).And, besides providing interpretability, the optimized KINNs require a factor of 2.4×10 2 fewer parameters compared to traditional CNNs (Figure 2C).This reduction in parameters also translates to faster training time compared to traditional CNNs, especially for larger batch sizes (Extended Data Figure 2).
We explored different candidate high-quality kinetic systems with different numbers of enzymatic states.To integrate multiple different kinetic models, we analyzed the architectures' posterior distributions using a Bayesian hierarchical multivariate approach, thus retaining the variability between models (Methods).Given a fixed number of states, the model architectures discovered by Elektrum were highly consistent between the gRNA1 and gRNA2 datasets (Extended Data Figure 3).For both datasets, when exploring more complex models, increasing the number of states from 4 to 5 gives a splitting of the first enzymatic state into two sequential states.Similarly, when going from 5 to 6 states, the last state is split into two sequential states.This suggests that Elektrum is discovering more complex free-energy landscapes that are being coarse-grained at fewer states, especially at the reaction's first and last stages.All models with different numbers of states (4, 5, or 6) performed comparably well on the in vitro data (Extended Data Table 2).This exploration of state number constructs a natural model ensemble for our in vivo transfer learning step, from which one KINN architecture and its pretrained kinetics weights will be selected based on the sequence and cellular context provided by in vivo data.

in vitro KINN models partially capture in vivo off-target editing
Precise prediction of off-target cleavages is critical for effective Cas9 deployment for basic research and safe therapeutic use in humans.Many experimental approaches, such as GUIDE-Seq, CIRCLE-Seq, SITE-Seq [23,[31][32][33][34][35], are designed to measure cleaved off-target sites in living cells for specific time periods and concentrations of gRNA-charged Cas9 RNP.These data are key for developing biological insight and for developing predictive methods, but on their own do not provide a comprehensive framework for predicting off-target cleavage frequency and time dependencies.Here, we use these in vivo test datasets as holdouts to evaluate in vivo performance of KINNs trained on in vitro data to predict Cas9 off-targeting cleavage rates.[30] 0.329 0.317 0.254 AttnToMismatch CNN [36] 0.071 0.025 − Elevation-score [31] 0.131 0.078 − CFD [32] 0.066 0.030 − Ensemble SVM [37] 0.113 0.048 − CNN std [38] 0.115 0.034 − CRISPRoff [39] 0.104 0.046 − Baseline 0.00056 0.00014 0.00028 Table 1 External in vivo data performance comparisons.In this work we proposed three new methods: KINN is an architecture-optimized kinetic model using in vitro MPKP data, while AMBER-CNN is an architecture-optimized deep CNN model trained with the same feature and label set as KINN.Elektrum is a hybrid KINN+CNN optimized model, trained with our novel transfer learning strategy.Existing machine learning methods are trained using various training datasets and are detailed in the corresponding publications.Baseline of AUPR is the proportion of positive labels in a test dataset, the expected AUPR value from a random classifier.
We assessed the sequence cleavage predictions by comparing predicted on-and off-target cleavage rates to non-edited sequences.First, we found a strong positive correlation between the predicted cleavage rate and off-target editing levels (as indicated by the observed high-throughput sequencing read coverage; Extended Data Figure 4).The KINN model accurately predicted the cleavage probability of off-target sites in in vivo data when compared to conventional state-of-the-art Cas9 off-target predictors; only CRISPR-Net, which leveraged a 158 times larger training dataset (training data size 1,112,198 for CRISPR-Net, 6,976 for KINN), outperformed the KINN model (Table 1).When trained on the same in vitro data, the KINN's in vivo predictions even outperformed CNNs despite KINN's slightly worse performance on the in vitro data (Figure 2).This suggests that in the presence of data extrapolation and domain shifts, like the in vitro training shifted to in vivo testing, a mechanistic model could be more robust than a complex neural network.
Furthermore, by having a learned underlying mechanistic model, KINNs can be generalized to systems outside of the parameters inherent in the training data, providing a framework to design improved protocols for genome wide editing.Conventional predictors, on the other hand, are limited to the specific in vivo experimental parameters in the training data which are at a fixed time exposure and concentration of Cas9.Thus these predictors are unable to model temporal dynamics of Cas9 activity, which is important for understanding and controlling editing specificity [40].

Conv1D
Identity Fig. 3 Integration of transfer learning in NAS for in vivo cleavage prediction.A) Optimal deep learning model architecture.The model consists of a convolution layer with small kernel sizes, a complex bidirectional recurrent layer, and a KINN layer.The output of KINN layer's cleavage rate is mapped to a binary label of edited vs unedited off-target sites in vivo.B) The selection probability of each layer for the optimal model from a comprehensive model space.Error bars represent the 95% confidence interval from five independent search runs.Colors are matched to the layers in panel A. Abbreviations: conv k{1,3,5}, convolution with kernel size of 1, 3, 5; conv d4, convolution with kernel size=3 and dilation rate=4; conv lin, convolution with kernel size=1 and linear activation function.bidirection {32,16}: bidirectional RNN layer with 32 or 16 hidden units.kinn {ij}: searched KINNs with i enzymatics states trained on the j-th in vitro gRNA dataset.

Novel integration of transfer learning with architecture search boosts in vivo cleavage prediction
We investigated if we could augment the KINN's ability to predict Cas9 cleavage kinetics in vivo while maintaining its mechanistic interpretability.To better capture Cas9 cleavage kinetics in vivo, we employed Elektrum's transfer learning strategy that incorporates a trained KINN with a convolutional backbone (Figure 3).This convolutional backbone is built by a CNN architecture search algorithm [22] to account for sequence context effects that are not captured in the in vitro-learned kinetic rates (Methods).Furthermore, multiple candidate KINNs are searched to automatically determine the optimal kinetic model that renders the most predictive power for the in vivo datasets.Thanks to the full interpretability of KINNs, the convolutional backbone only needs to learn a modifier term for in vivo-specific effects while keeping the pretrained weights of in vitro KINNs frozen.While fine-tuning KINN weights may improve the final model's performance, it can complicate the interpretation of the original KINN parameters and reduce interpretability.With that said, there is an argument to enable fine-tuning in KINN in our code implementation, making it easy to adapt for other use cases.Finally, we constructed a model ensemble by running the (stochastic) NAS multiple times (n=5).We applied this transfer learning strategy to enable the integration of the mechanistic in vitro trained KINN models with the sequence context learned from large in vivo data collections [30].Our transfer learning architecture search substantially improved the in vivo performance of cleavage probability prediction as compared to the base KINN trained on in vitro data alone.The Elektrum model ensemble improves testing performance across the validation (Extended Data Figure 5) and all external datasets; this approach even outperformed the current non-interpretable state-of-the-art predictors in 2 out of the 3 test datasets (Table 1).Surprisingly, we found that, from a comprehensive space of possible CNN models, the data-driven NAS consistently selected (across all independent search runs) simpler architectures.The search algorithm was trained to select from variable kernel sizes for two convolution layers and included a specialized Inception layer [41].This was followed by a selection between a simple and a complex bi-directional Long-Short Term Memory (LSTM) layer, and between a standard and a self-attention flatten layer (see Methods for details; Figure 3 and Extended Data Figure 6).This broad set of architectures covers most previously proposed expert-designed CNNs [9,11,30,42].The searched optimal architectures for each of the five runs are detailed in Extended Data Table 3.As a result, the simpler architectures selected by the datadriven NAS used the first convolution layer with smaller kernel sizes instead of the more sophisticated Inception layer chosen by human experts in ref. [30].Furthermore, the second convolution layer, common in human-designed CNNs [9,42], tended to be replaced by an identity layer and thus removed by NAS.In contrast to simplifying convolution layers, the NAS found it important to select the complex over the simple bi-directional LSTM layer (Figure 3), suggesting the potential existence of long-range interactions in this biological system.

Interpreting importance of sequence context for in vivo
Cas9 cleavage rate prediction A unique advantage of Elektrum, besides superior performance, is its prediction of the physical in vivo kinetic rates.This enables us to assess which sequence contexts are important for Cas9 in vivo editing by comparing the full Elektrum in vivo model predictions versus the KINN in vitro predictions (Figure 4).We found a general strong positive correlation between the in vitro cleavage rates and in vivo predicted cleavage probabilities, suggesting that in vitro kinetics is relevant to the in vivo context.However, despite the overall positive correlation between in vitro and in vivo rates, in vivo rates were generally higher than in vivo rates (blue, green and orange boxes in Figure 4A).The underestimation for cleavage rate in vitro was consistent across two external datasets (Figure 4A and 4B), suggesting the involvement of particular sequence elements in in vivo cleavage activation.
To understand the sequence determinants underlying these differences between in vivo vs in vitro predicted Cas9 cleavage rate, we assessed the context-specific base-pair importance using SHAP [43].SHAP is an explainable AI technique based on game theory that quantifies each feature's contribution to a model's prediction (Methods).SHAP analysis here indicates that the presence of guanines in the Cas9 target regions is important to explain the increased cleavage seen in vivo (Figure 4).Across all analyzed sequences, guanines had significantly higher SHAP values compared to other nucleotides in the target region (Extended Data Figure 7).This suggests that guaninerich off-target regions are more likely to be cleaved in vivo, a phenomenon that is concordant with previous reports [44].

Discussion
In the biological sciences, developing quantitative understanding and building predictive mechanistic models requires data from both purified systems and cellular data.In purified in vitro systems, most components are known and their interactions are simpler to understand, but the assays lack relevant cellular or genomic context.On the other hand, in vivo cellular data contains this context and is often more plentiful, but that context and measurement uncertainties make it difficult to identify underlying mechanisms and confounding effects.Biophysical and kinetic models can elucidate underlying mechanisms, while deep learning models are powerful tools and need little prior knowledge for making predictions.By combining KINN ensemble generation with NAS-selected transfer learning, Elektrum leverages both data sources to generate interpretable mechanistic models that also integrate genomic and cellular context.
The overall utility of this approach is demonstrated by Elektrum's performance compared to more traditional state-of-the-art NN models.Elektrum is both interpretable and predictive because of the information encoded in the basic physicsinformed structure of the KINNs.Furthermore, in allowing the NAS to choose from an ensemble of high-quality KINNs, we can assess and fine-tune over this ensemble of kinetic models in a data-driven manner using a transfer learning approach.
Our method was designed to rapidly hypothesize an ensemble of kinetic models and then apply said models with little prior knowledge to in vivo processes.The interpretable NAS is efficient because KINNs are backwards differentiable and train faster than other parameter optimization methods [15,45].However, because in vivo systems are affected by biological factors not present in in vitro systems, accurate predictions depend on the learned enzymatic reaction being a key step in the in vivo biological process.For example, we assume the kinetics of DNA cleavage is the dominant process for predicting Cas9 editing in vivo.If this assumption is violated, the model should require a larger CNN backbone to compensate for processes not encapsulated in the enzymatic reaction.However, Elektrum tended toward lower complexity CNNs during NAS (Figure 3), supporting the idea that the cleavage pathway learned by KINNs is paramount in in vivo editing.
Elektrum provides a powerful tool for prioritizing gRNA targets to enable safe and effective genome editing.Given a gRNA sequence, Elektrum predicts the cleavage rates for both on-target and off-target regions.The advantage of Elektrum is its state-of-theart predictive performance as well as its full interpretability.However, Elektrum still makes false positive and false negative predictions when using in vivo experimentally measured off-targets as gold standards.We performed the SHAP analysis for the top false postive and false negative predictions of the in vivo model (Extended Data Figure 8) and in vitro model (Extended Data Figure 9).The false negatives (experimentally measured off-targets but predicted unedited) have multiple mismatches, sometimes even on the PAM sites, suggesting additional factors that may have facilitated the in vivo cleavage of these loci.
The interpretability of the KINN models lends biophysical insights into possible contributing factors to inaccurate predictions.For example, in Moreb and Lynch [46], the authors argue that the lack of model generalization between species is caused by the variability of Cas9/gRNA "search spaces", the locations and states enzymes must explore to find correct target sequences.Equipped with sequence dependent kinetics, the search space can be defined as an organism's genome modeled as a bath of various DNA substrates.As will be presented in future work, off-target sites with large Cas9/gRNA dwell times sequester the protein complexes from solution, slowing down the cleavage rate of target sites; a phenomenon seen previously but not fully characterized from experimental work [47].Furthermore, unlike conventional machine learning models, these KINN-informed biophysical models can easily include known processes like protein or RNA degradation, further changing Cas9 cleavage efficiency in an off-target concentration dependent manner.Such considerations help unify the discrepancies between species specific predictions.Still, more studies are required to elucidate other important in vivo-specific effects, such as epigenetic modification, accessibility of genomic regions, and cell line-specific effects.
Elektrum's method for discovering reaction kinetics only requires sequence-like input (amino acid sequence, DNA sequence, epigenetic marks, etc.) that can be perturbed and leads to different observable and time-dependent outcomes.Massively parallel kinetic profiling assays [21,48] are prime candidates for in vitro data for training KINNs.As more MPKPs are published, the Elektrum framework will become a common tool to quickly discover kinetic models and predict related biological pathways.In the future, Elektrum could be extended to other protein and irreversible biochemical reactions such as the kinetics of restriction enzymes, proteases, motor protein ATP cycles and motion, and ribosomes.

Kinetic theory and King-Altman diagrams
Modeling enzymatic reactions gives us a way to predict the production, degradation, and transition of biochemical components into different states.Here we first describe the theoretical foundations for chemical kinetic theory and then translate governing equations to a neural network optimization problem.
A linear multi-step enzymatic reaction can be defined using a system of equations in the form of where k αβ is the transition rate constant from enzymatic state α to β and S α is the concentration of reactant or product in state α.The values of reaction rates depend on a discrete sequence vector x.Equation ( 1) can be represented as a graph G = (V, E) where vertices V ⊆ {S α } and edges E ⊆ {k αβ |S α , S β ∈ V }.All reactions of this kind then exist in the space G = {G}.Our goal is to search G to determine the correct G for a reaction given that the system is in a pseudo-steady state and that there are measurable products from specific reactions kαβ , where kαβ is a subset of reactions that lead to an experimentally measurable quantity.Equation ( 1) can be written more succinctly by defining the matrix and, for convenience, normalizing the concentration vector which has the solution (5) where s ∞ is the normalized steady-state solution, i.e., K • s ∞ = 0. We can solve for components of s ∞ using the King-Altman (KA) method [49].
The King-Altman method uses Cramer's rule to show that where where the product is over all the rates in the ℓth KA diagram.KA diagrams are acyclic subgraphs of G that contain edges one less than the number of all vertices in G [50].Also, all directed edges (rate constants) must be a part of a path that leads to the same vertex (state) in that diagram.The notation ℓ → α means that KA diagram ℓ ends on state α.Given an arbitrary K, we find all KA diagrams using the algorithm put forth by Lam and Priest [51].
There exists a subset of reactions with rate constants kαβ ⊂ k αβ that map to an experimentally measurable quantity y.The sum of all the production rates is called the 'activity' denoted as ν: ν where only certain state changes, i.e., α → β ∈ act, contribute to activity and map to an experimental measurable with the equation y = σ(ν), where σ(x) is a onedimensional monotonically increasing function with domain and range ∈ [0, ∞).The model search space includes all possible kαβ and σ so that G ∈ G, { kαβ }, {σ} .

Construction of kinetically interpretable neural networks from the King-Altman formalism
Kinetically Interpretable Neural Networks (KINNs) are constructed as a more generalized framework based on the previous work [15].Given the number of enzymatic states, we focus on optimizing KINN's hyperparameters that determine the mapping of sequence to kinetic rate constants.We perform a separate analysis for each specified number of enzymatic states and associated transition rates, with the number of states ranging from 4 to 6. Automatic exploration of viable multistate models is left for another paper.A schematic of a KINN architecture and the associated 4-state kinetic model for Cas9-DNA cleavage is shown in Extended Data Figure 10.
The first KINN layer aims to learn the mapping of sequence x to kinetic rate constants k αβ , ∀{α, β}.We note that kinetic rate constants are specific variables describing a reaction independent of reagent concentrations (as opposed to timedependent kinetic rate), although the term 'constant' may cause potential confusions for readers unfamiliar with this concept.We assume a rate constant k αβ is dependent on a subsequence of x, denoted by x i:j ∈ R (j−i)×4 , the one-hot encoded subsequence from the i-th to the j-th nucleotide on the input sequence x i:j .We use a single-filter convolutional layer, with kernel W αβ ∈ R d×4 and bias term b αβ , to learn k αβ : For each k αβ , we search the hyperparameters including the sequence determinant index i, j and the convolution kernel size d.When d = 1, the operation simply computes a weighted sum of each gRNA-DNA nucleotide pair's contribution.When d > 1, it can model the local neighbor interaction effects within the window.When d = (j−i), the kernel length is equal to the input sequence length, and the convolution layer becomes a fully connected layer.
The second KINN layer nodes correspond to the KA diagram terms κ ℓ in equation (7).The first and second layers are connected by a static(untrained) binary layer B with the connectivity determined by the transitions in each KA diagram described previously [51].An example of KA diagrams and associated binary matrix for a 3-state cas9 system is shown in Extended Data Figure 11.The summation of rates in log-space implies a multiplication in linear-space.Therefore, equation ( 6) can be re-written by applying a softmax activation to the product of the rate constant layer with B Finally, the effective cleavage rate is computed from equation (8).In our Cas9 massively parallel kinetic profiling assay, this is the observed cleavage rate.To calculate activity, exponentiated log-rate nodes' values from the first layer multiply the enzymatic state before the activity-producing transition.If multiple rates contribute to activity, then all activity rates are included according to equation (8).A final activation function or mapping may be applied to the activity node if a measurement of the activity-measurement relation is known to be non-linear.
where the non-linear activation functions σ ν , σ o , and dimensions of W t , can also be searched.

Probabilistic model building genetic algorithm for searching KINN
We employ a genetic algorithm to search for KINN model architectures.To incorporate existing knowledge and prior beliefs about a kinetic system, we use a Bayesian genetic algorithm based on probabilistic model building [52].For each kinetic rate constant, we first define a prior probability distribution for its hyperparameters: the sequence determinant i, j and the convolution kernel size d.Then we sample model architectures and train the weights to evaluate performance.The search algorithm updates the posterior distribution by the architectures of each generation's surviving models.Following the notations of k αβ and its sequence determinants x i:j , as an uninformative prior, we set i ∼ Multinomial([ϕ(k) − w/2, ϕ(k) + w/2]) and (j − i) ∼ Multinomial( [1,5]), where ϕ(k) is the anchor function that uniformly maps each kinetic rate to sequence intervals on the input sequence x; w is the search window size.For instance, for a four-state s 0 , s 1 , s 2 , s 3 KINN with 20bp input sequence x, ϕ(k 01 ) = 5, ϕ(k 12 ) = 10, and ϕ(k 23 ) = 15.The function that maps x i:j to log(k αβ ) is parameterized as a simple one-layer linear convolutional neural network and we search for the convolution kernel sizes.
In the first iteration, the genetic algorithm uniformly samples a number of m = 10 KINNs from the prior probability distributions.The fitness of a KINN is defined by Pearson correlation between the prediction and observations from the validation dataset.The expected fitness for each generation t, C t , is evaluated by the average fitness values of KINNs sampled.Subsequently, KINNs with a fitness higher than C t survive and their model architectures are used to update the posterior distributions.

Benchmarking by simulated kinetic datasets
We tested our method of KINN optimization on simulated data where the kinetic rates' dependence gRNA-DNA alignment was known exactly.We started with an arbitrary sequence of 50 nucleotides.Only a subsequence from the 5th to the 40th nucleotide pairs contribute to the simulated cleavage rate.The 35 nucleotide pairings were separated into seven five-nucleotide long subsequences, each determining the value of a different kinetic rate constant for a 4-state Cas9 cleavage cycle.
Transition rates were calculated from the Arrhenius equation where the factor inside the exponent is negative for forward reactions (α < β) and positive for backwards reactions (α > β).Ω αβ is the set of pair-indices i that contribute to a reaction's free energy barrier.No rates shared nucleotide pair dependence with other rates, i.e., Ω Once the rate dependence was specified, 20000 sequences were randomly generated and the forward and backward rates calculated.From these rates, we constructed the rate matrix from equation ( 2) where D u = 1 nMol is the constant concentration of unbound DNA.The largest, nonzero eigenvalue was deemed the cleavage rate for DNA-gRNA pairs, i.e., what KINNs were trained to predict.

Relation between King-Altman and single rate learning
We observed that training our KA network on single-rate data produces accurate predictions of the kinetic rate constants and activity despite not being at steady state, a requirement for the KA method.This suggests some equivalency between the cleavage rate observed by steady state and uncut DNA-depleting systems.Here we show that, under certain assumptions, the data produced by the two systems are approximately equivalent for training KINNs.
First we construct a kinetic model with a depleting uncut DNA reservoir D u (t) with unnormalized state vector S(t) = (D u , S 1 , S 2 , S 3 ) T .Notice the differences between the matrices (13) and (14).The element (K ′ ) 3,3 = −(k 34 + k 32 ) while (K cas9 ) 3,3 = −(k 30 + k 32 ), there are elements of k 01 S 0 in K ′ instead of k 01 D u as in K cas9 , and finally (K ′ ) 0,3 = 0 while (K cas9 ) 0,3 = k 30 .The first difference is just notational since both rates produce the measurable quantity, cut DNA, and are therefore equivalent k 34 = k 30 = k cut .A similar argument can be made for the second difference assuming that the concentration of unbound Cas9 S 0 remains roughly constant.This allows us to swap D u for S 0 .However, the final difference changes the determinant of the matrix det K ′ ̸ = 0 while det K cas9 = 0.
If we compare the characteristic polynomials C(x; K) of these models we see that C(x; K ′ ) − C(x; K cas9 ) = a 0 (K ′ ) = det K ′ .This is notable because det K ′ = k 01 k 12 k 23 k cut = αβ∈act kαβ s α of the original system.Even more interesting is that a 1 for both systems is the denominator of equation ( 6), i.e.
Therefore, small |λ| values are approximately which is the activity ν given by the King-Altman equations ( 6) and (8).However, for more complicated systems, this approximation may not hold.Therefore, one would utilize an eigendecomposition layer instead of the KA layer to find the largest eigenvalue [53].This approach was tested and showed similar performance to KA layers (Extended Data Figure 1).

Convolutional neural network architecture search with pre-trained KINNs
We developed a new neural architecture search (NAS) method for automatically searching CNNs with pre-trained KINNs to model in vivo kinetic systems.Let κ = log(k) be a vector of log kinetic rate constants from a pretrained KINN in vivo, and k be the kinetic rates in vitro, we seek to learn a modifier term δ ℓ for each kinetic rate k ℓ such that δ ℓ account for the in vivo sequence-context effects and k ′ = exp(κ+δ).These in vivo kinetic rates k ′ are then fed into the downstream KINN layers of King-Altman layer and activity layer.The final binary output of cleavage probability is modeled by a sigmoid layer with a single scalar kinetic activity as input.
We parameterized the functional dependency of δ to the input gRNA and DNA sequences as a deep convolutional neural network backbone f .The input for f is a 13-bit encoding for gRNA-DNA alignment by 25 nucleotides, where we used 4-bit to one-hot encode the gRNA sequence, 4-bit to one-hot encode DNA substitutions, 4-bit to one-hot encode DNA insertions, and 1-bit to encode DNA deletions.The 25 nucleotides consist of 3bp PAM site, 20bp gRNAs, and 2bp for padding DNA insertions.The padding columns are represented as Ns and are encoded as zeros.The output from the CNN is a flatten vector, which will be linearly transformed to δ and followed by log(k ′ ) = κ + δ.The dimensionality of the linear transformation will be determined based on the pretrained KINN and its κ.The pretrained KINNs are converted to special layers with two inputs: one input directly from the gRNA-DNA alignments to compute κ, and the other input δ from the deep CNN backbone.
To search for optimal CNN architecture, we employed AMBER (v0.1.3),an AutoML framework developed previously [22] that demonstrated state-of-the-art NAS performance in genomics.AMBER searches neural network architectures by two components, 1) a controller model that optimizes neural architecture through reinforcement learning and 2) a model space to sample neural architectures.The search and optimization process of neural network architectures is described in length previously [22].
Our model space for the CNN architectures consisted of 7 layers.All layer search spaces besides the first convolution layer, the flatten layer, or the KINN layer, include an identity operator which would effectively remove that layer if selected.This enabled the controller model to reduce model complexity during architecture optimization.The first two layers are convolution operators with different kernel sizes and dilation rates.Alternatively, the controller can select specialized Inception layers that aim to capture patterns at multiple scales by combining kernels of different sizes into a single layer.The convolution layer was followed by a dropout layer to regularize model complexity and improve model generalization, where we searched for its dropout probabilities.The next layer searched for bidirectional LSTM layers with either 32 units or 16 units, with the option of an identity operator.This layer was followed layer with a search space that included a standard flatten layer, an attention-based flatten layer, a dense layer of 64 units, and an identity operator.Finally, we included the KINNs pretrained on the in vitro data, with three different state numbers on two separate datasets, yielding a total of 6 pretrained KINN configurations.

CRISPR/Cas9 datasets and train-testing split
To create the Elektrum model, we divided the CRISPR/Cas9 cleavage datasets into two categories.The first category consisted of time-resolved kinetic assays measured in vitro.The second category consisted of in vivo off-target editing measurement using high-throughput assays.We described the processing and train-test split for each category below.
For the in vitro data, we re-compiled the Massively parallel kinetic profiling (MPKP) data for Cas9 generated previously [21].The MPKP measured the cleavage rate of SpCas9 on 14,000 targets containing mismatches, insertions, and deletions relative to two different gRNA templates.For each gRNA-DNA pair, we converted their sequences as described in the previous section to as a 13-bit by 25-bp encoding matrix.For each gRNA sequence, we trained separate models using all data points, while using 50% of the other sgRNA datapoints as validation data in order to search for neural network architectures.The remaining 50% of the other sgRNA data points are held out as testing data.
For the in vivo data, we used the assembled training dataset from CRISPR-Net [30].Briefly, this training data was collected and uniformly processed from 8 datasets in previously published Cas9 off-target assays [23,[31][32][33][34][35].In this data, the negative data points (that is, inactive off-target sites) were constructed by searching genome-wide DNAs by Cas-Offinder [54], a computational tool that searches for potential off-target sites of Cas9.To ensure a fair comparison, we held out the same three testing datasets to match the original train-test split as in CRISPR-Net.In total, the testing dataset has 693,235 gRNA-DNA pairs.For the remaining data, we randomly split the dataset by gRNA sequences, holding out 20% of the gRNA and all their corresponding gRNA-DNA pairs as validation datasets.This corresponds to a total of 842,298 gRNA-DNA pairs for training and 269,900 gRNA-DNA pairs for validation.

Neural architecture analysis for KINN and Elektrum
To evaluate the neural architectures created from the KINN search on the Cas9 MPKP assays and account for the stochasticity between different search runs, we developed a Baysian hierarchical model to infer the maximum a posteriori estimate for neural architecture parameters.The Bayesian hierarchical model allowed us to consider the uncertainty and covariance in a principled framework.Specifically, we employed a multivariate normal (MVN) distribution to model the group-level KINN architecture parameters.The MVN's mean followed an uninformative prior.Its variance-covariance matrix was defined by a distribution over Cholesky decomposed covariance matrices, such that the underlying correlation matrices followed an LKJ distribution [55] with a standard deviation η = 1.We assumed each replicate run draws a set of architecture parameters from the group-level MVN distribution, such that the optimized KINNs follows another sample-level MVN with its mean determined by the group-level sampled parameters.The covariance matrix of the sample-level MVN also followed an uninformative prior of an LKJ distribution with the standard deviations η = 1.The top 5% performing architectures from each run were fed into the Bayesian model for parameter fitting.We implemented the above model using PyMC3 [56].
For the transfer learning NAS selection probability, we simply pooled the last five architectures from each search run across five independent search runs.This was done because each layer was modeled by a recurrent neural network from its preceding layers without complex multivariate dependencies.We then computed the mean and variance of the selection likelihood for a candidate layer within the given model space.

Base-pair importance interpretation by SHAP
In order to determine the base-pairs that are important in explaining the difference between in vitro and in vivo cleavage predictions, we applied SHAP analysis to both the base KINN and the trained Elektrum model.Both KINN and Elektrum were endto-end differentiable, which allowed us to use the GradientExplainer to assess the importance of each base-pair within each model.GradientExplainer seeks to interpret a model using expected gradients, which is an extension of integrated gradients [43].We then normalized the maximum absolute base-pair importance to 1, making the importance scores comparable across different predictions.After obtaining the normalized base-pair importance scores for both in vitro and in vivo cleavage predictions, we subtracted the in vitro scores from the in vivo scores to determine the specific importance of each base-pair in explaining the difference between in vitro and in vivo cleavage.In this final base-pair importance matrix, if a base-pair contribution is identical in vivo vs in vitro, it's score would be zero; in contrast, a positive importance score indicates the activation of cleavage in vivo specifically, and a negative value indicates the inactivation in vivo specifically.To compute the integrated gradient values, we used an all-zero matrix of shape 25 by 13 as the background.Finally, using the five models constructed by independent AMBER runs, our interpretation method of basepair effects considers both the averaged SHAP values as well as its variance across models.

Data Availability
All training and testing data presented in this paper are publicly available.The in vitro kinetic data was downloaded from Jones et al. [21] in URL https://github.com/finkelsteinlab/nucleaseq/tree/master.The in vivo CIRSPR/Cas9 off-targeting dataset was downloaded from CRISPR-Net [30] in URL https://codeocean.com/capsule/9553651/tree/v2.

Fig. 1
Fig.1Overview of Elektrum framework.A) We use a probabilistic model-building genetic algorithm to search for and create a set of Kinetically Interpretable Neural Networks (KINNs), a special type of sparse neural networks that has one-to-one correspondence with kinetic models.These KINNs differ in number of enzymatic states and reaction rates but model the in vitro data equally well.B) KINNs can be translated to a system of first-order rate equations.For CRISPR-Cas9, this corresponds to a multi-step enzymatic reaction, where the kinetic rate for each enzymatic step is determined by the DNA-gRNA sequence pair.C) Transfer learning integrates KINNs pretrained on in vitro data with a deep neural network to predict in vivo Cas9 cleavage.We employ neural architecture search (NAS) to optimize the convolutional and bidirectional layer while searching for an optimal static KINN layer.This integrative approach expands the types and amount of applicable training datasets by many fold while maintaining kinetic interpretability.

CFig. 4
Fig. 4 Interpreting sequence context for in vivo Cas9 cleavage rate prediction.The comparison of the full Elektrum in vivo predicted cleavage rates versus the KINN in vitro predicted cleavage rates in A) Listargen data [31] and B) Kleisitiver data [35].Both datasets were held-out during the training of Elektrum and KINN.Boxplot elements: center line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; points, outliers.C) SHAP analysis reveals the presence of guanines in the Cas9 target regions is important to increased cleavage in vivo.Shown are two gRNA-DNA pairs with the largest predicted cleavage rate difference between in vivo and in vitro.The gRNA sequence is shown on the x-axis.On the y-axis, the first four rows represent one-hot encoded gRNA sequence, and the last four rows represent the observation of nucleotide substitutions at specific locations in the target DNA sequence.Color represents the average SHAP values across five best models created independently in five search runs; size of the dots scale inversely with variance of SHAP across the five models.
the difference in free energy contributed by the ith DNA-gRNA nucleotide pair for the transition α → β.For this data, the values were taken from the set ∆ Ḡαβ (x i , ξ i ) ∈ {−1, −0.1, 0.1, 1}.Complementary pairs were assigned a value of −1 but mismatched pairs varied for different transitions.All rates were multiplied by a baseline kinetic rate k o = .01sec −1 .Rate constant k 01 is technically multiplied by k o = .01sec −1 nMol − 1 to maintain proper units.